Spark 4.1 | 4.0 | 3.5 | 3.4: Fail publish_changes procedure if there's more than one matching snapshot#14955
Conversation
0699a5f to
7ed4b84
Compare
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/PublishChangesProcedure.java
Outdated
Show resolved
Hide resolved
| throw new ValidationException( | ||
| "Cannot apply non-unique WAP ID. Found %d snapshots with WAP ID '%s'", | ||
| numMatchingSnapshots, wapId); |
There was a problem hiding this comment.
I wonder if we should not allow this situation like 2 snapshots with same wap id to happen in the first place, during the snapshot creating time ?
There was a problem hiding this comment.
I don't have a strong opinion here, but this might be considered a more significant / potentially breaking change? Technically having a duplicate WAP ID doesn't cause any problems until they are cherry-picked into main.
Do you think there might be legitimate uses for staging multiple changes under the same WAP ID? For example:
- staging multiple changes, evaluating all of them separately and then deleting all but one before committing.
- creating staged snapshots which are never intended to be published (for testing / evaluation / etc)
I am not super familiar with the original designs behind WAP in iceberg, I'll look through older commits to see if there's any mention of a uniqueness constraint.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/PublishChangesProcedure.java
Outdated
Show resolved
Hide resolved
7ed4b84 to
df314e9
Compare
singhpk234
left a comment
There was a problem hiding this comment.
LGTM, thanks @SamWheating !
Lets give it sometime before we check it in, incase someother folks have feedbacks on this.
|
Thanks @singhpk234 ! Whats the preferred approach for applying this change to previous spark versions? Should I wait until this is approved and merged before creating a single backport PR for all of them? |
| if (!wapSnapshot.isPresent()) { | ||
| throw new ValidationException("Cannot apply unknown WAP ID '%s'", wapId); | ||
| Iterable<Snapshot> wapSnapshots = | ||
| Iterables.filter( |
There was a problem hiding this comment.
nit: Instead of filtering all matching snapshots, could we scan table.snapshots() once and fail as soon as we see a 2nd match (avoid full-history scan)?
There was a problem hiding this comment.
This is a good point, I've rewritten the procedure to early-exit on the first conflicting snapshot.
|
Since this changes |
Definitely, but in that case we should also ensure that all of the different spark distributions are also updated to be consistent with the docs (not just 4.1). Actually maybe I should get some feedback on this code before I replicated it into 4 different places 😂 If the code + doc changes look good I will add the backports. |
|
@huaxingao could you take a look at the updated procedure and let me know what you think? If this looks good I will make another commit to backport the procedure change to the other distributions. |
|
Thanks for the reviews @huaxingao and @singhpk234, I have backported the fix to other spark versions now so everything should be consistent with the updated documentation. Let me know if there's anything else I can do to help get this merged! |
|
@SamWheating I only checked 4.1. Are the back-porting to 3.4, 3.5, 4.0 clean back-porting? |
|
Yes, the return type on the Changes to tests are identical between versions and all versions passed. |
|
Thanks @SamWheating for the PR! Thanks @singhpk234 for the review! |
Closes: #14953 - see this issue for a larger description and reproduction.
Its assumed that
wap.idwill be unique among snapshots, but this doesn't appear to be enforced anywhere which can lead to unexpected results when only the first write is actually published.This PR updates the
publish_changesprocedure to fail when multiple matching snapshots are identified.If this change is approved I will backport it to the other spark versions.